Skip to content

Conversation

milesial
Copy link

@milesial milesial commented Jan 3, 2023

What does this PR do?

When training when native AMP and a LR scheduler, we get this warning that indicates that a LR step has been taken when an optimizer step was skipped (expected at the beginning of the training with native AMP):

/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

Fixes #16228 #5558

Does your PR introduce any breaking changes? If yes, please list them.

No

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@milesial milesial requested a review from justusschock as a code owner January 3, 2023 20:22
@milesial
Copy link
Author

milesial commented Jan 3, 2023

In the process of fixing tests I discovered and fixed a bug where the scheduler wouldn't match its optimizer when multiple optimizers are instantiated with frequencies. Now the optimizers and schedulers match and alternate as they should, resetting the cycle every epoch.

@milesial
Copy link
Author

milesial commented Jan 9, 2023

@carmocca Ready for final review

return
active_optimizers = _get_active_optimizers(
self.trainer.optimizers, self.trainer.optimizer_frequencies, self.total_batch_idx
self.trainer.optimizers, self.trainer.optimizer_frequencies, self.batch_idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test to verify this works properly ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the third case of test_step_scheduling_for_multiple_optimizers_with_frequency so that it tests that

@awaelchli awaelchli added community This PR is from the community optimization labels Jan 9, 2023
@carmocca carmocca requested a review from Borda as a code owner January 10, 2023 16:13
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the failing tests?

setup.cfg Outdated
cloud:Run the cloud tests for example
filterwarnings =
error::FutureWarning
error:Detected call of `lr_scheduler.step\(\)` before `optimizer.step\(\)`:UserWarning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this line so that our CI fails if this warning appears. This way it tests that your patch works as expected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but this also makes IPU tests fail, this PR is focused on GPU. Not sure where to fix the issue on IPUs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milesial IMHO, we could just skip the IPU test failures as done in my old PR #11755 for now.

@milesial
Copy link
Author

milesial commented Jan 10, 2023

The way I fixed the tests/tests_pytorch/models/test_hooks.py::test_trainer_model_hook_system_fit[True-kwargs1] tests is very flaky, so I'd appreciate if someone more familiar with these tests comes up with a better fix.
edit: seems like it didn't even fix it...

@milesial
Copy link
Author

I also modified native_amp.py in both lightning_fabric and pytorch_lightning. It doesn't seem like the lightning_fabric one is called for a typical workflow, so I don't know if it's the right way

@Lightning-AI Lightning-AI deleted a comment from stale bot Apr 25, 2023
@Borda
Copy link
Contributor

Borda commented Apr 25, 2023

Hi @Borda, I also encounter the same issue. Will this be merged?

Let me check what is missing here...

@mergify mergify bot removed the has conflicts label Apr 26, 2023
@alimoezzi
Copy link

Is this PR merged already? I'm still having this issue.

@Borda
Copy link
Contributor

Borda commented May 23, 2023

there were some failing tests, @milesial mind have a look?

Copy link

gitguardian bot commented Nov 18, 2023

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
While these secrets were previously flagged, we no longer have a reference to the
specific commits where they were detected. Once a secret has been leaked into a git
repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Copy link

codecov bot commented Nov 18, 2023

Codecov Report

❌ Patch coverage is 28.57143% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 48%. Comparing base (1fc077b) to head (5bff40a).

❌ Your patch check has failed because the patch coverage (29%) is below the target coverage (50%). You can increase the patch coverage or adjust the target coverage.
❌ Your project check has failed because the head coverage (48%) is below the target coverage (99%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (1fc077b) and HEAD (5bff40a). Click for more details.

HEAD has 378 uploads less than BASE
Flag BASE (1fc077b) HEAD (5bff40a)
cpu 108 27
python3.10 24 6
lightning_fabric 26 14
pytest 57 30
lightning 61 16
python3.11 24 6
python 12 3
python3.12 12 3
python3.12.7 36 9
pytorch2.1 12 0
pytest-full 54 0
pytorch_lightning 24 0
pytorch2.7 6 0
pytorch2.4.1 6 0
pytorch2.5.1 6 0
pytorch2.2.2 6 0
pytorch2.6 6 0
pytorch2.8 6 0
pytorch2.3 6 0
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #16229     +/-   ##
=========================================
- Coverage      87%      48%    -39%     
=========================================
  Files         269      266      -3     
  Lines       23656    23606     -50     
=========================================
- Hits        20633    11419   -9214     
- Misses       3023    12187   +9164     

@mergify mergify bot removed the has conflicts label Feb 16, 2024
@carmocca carmocca removed their assignment May 6, 2024
Copy link

stale bot commented Apr 26, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Apr 26, 2025
@stale stale bot removed won't fix This will not be worked on labels Aug 19, 2025
@Borda Borda added the waiting on author Waiting on user action, correction, or update label Sep 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community This PR is from the community fabric lightning.fabric.Fabric optimization pl Generic label for PyTorch Lightning package waiting on author Waiting on user action, correction, or update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong LR scheduler behaviour when using AMP

8 participants